library(tidyverse)
library(ggplot2)
Before starting, I am loading the person-level data I created, and I am going to recode some variables to make labeling axes easier later on.
transpo_data <- read.csv("transpo_data.csv") %>%
mutate(meansTW = case_when(
JWTR_label=="Bicycle"~"Bicycle",
JWTR_label=="Bus or trolley bus"~"Bus or\ntrolley bus",
JWTR_label=="Car, truck, or van"~"Car, truck,\nor van",
JWTR_label=="Motorcycle"~"Motorcycle",
JWTR_label=="Other method"~"Other",
JWTR_label=="Railroad"~"Railroad",
JWTR_label=="Streetcar or trolley car (carro publico in Puerto Rico)"~"Streetcar or\ntrolley car",
JWTR_label=="Subway or elevated"~"Subway or\nelevated",
JWTR_label=="Taxicab"~"Taxicab",
JWTR_label=="Walked"~"Walk")) %>%
mutate(edu = case_when(
SCHL_label=="Some college, but less than 1 year"~"<1 year of college",
SCHL_label=="Regular high school diploma"~"High school diploma",
SCHL_label=="Bachelor's degree"~"Bachelor's degree",
SCHL_label=="1 or more years of college credit, no degree"~"> 1 yr of college,\nno degree",
SCHL_label=="Master's degree"~"Master's degree",
SCHL_label=="GED or alternative credential"~"GED",
SCHL_label=="Doctorate degree"~"Doctorate degree",
SCHL_label=="Associate's degree"~"Associate's degree",
SCHL_label=="Grade 8"~"Grade 8",
SCHL_label=="Grade 7"~"Grade 7",
SCHL_label=="Grade 10"~"Grade 10",
SCHL_label=="12th grade - no diploma"~"12th grade -\nno diploma",
SCHL_label=="Grade 6"~"Grade 6",
SCHL_label=="Grade 11"~"Grade 11",
SCHL_label=="Professional degree beyond a bachelor's degree"~"Professional degree",
SCHL_label=="Grade 9"~"Grade 9",
SCHL_label=="No schooling completed"~"No schooling",
SCHL_label=="Grade 5"~"Grade 5",
SCHL_label=="Grade 4"~"Grade 4",
SCHL_label=="Nursery school, preschool"~"Preschool",
SCHL_label=="Grade 3"~"Grade 3",
SCHL_label=="Grade 1"~"Grade 1",
SCHL_label=="Kindergarten"~"Kindergarten",
SCHL_label=="Grade 2"~"Grade 2"))
In this assignment, I had 5 continuous variables, 1 binary categorical variable, and 2 categorical variables, one with 10 unique categories and the other with 24 unique categories. This means that I ran 10 Pearson’s correlation tests, 5 2-sample t-tests, 10 ANOVA tests, and 3 chi-square tests.
My sample size remains the same as my last assignment: 6914 people in Colorado who pay rent, and spend time commuting to work.
veh_inc_correlation <- cor.test(transpo_data$vehicle, transpo_data$PINCP)
veh_inc_correlation
##
## Pearson's product-moment correlation
##
## data: transpo_data$vehicle and transpo_data$PINCP
## t = -3.7988, df = 6912, p-value = 0.0001466
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06914258 -0.02209666
## sample estimates:
## cor
## -0.04564493
options(scipen = 999)
ggplot(transpo_data, aes(x = PINCP, y = vehicle)) +
geom_point(size = 0.5, alpha = 0.25, position = "jitter") +
geom_smooth() +
theme_bw() +
scale_x_continuous(name = "Income") +
scale_y_continuous(name = "Number of Vehicles Accessible",
breaks = seq(0, 6, by = 1),
labels = paste(prettyNum(seq(0, 6, by = 1),
big.mark = ","))) +
annotate(geom = "text", x = 0, y = 7,
label = paste("Correlation (95-percent confidence):",
prettyNum(veh_inc_correlation$conf.int[1], digits = 2),
"to",
prettyNum(veh_inc_correlation$conf.int[2], digits = 2)),
hjust = 0)
The correlation between income and number of vehicles accessible is surprisingly weak with an r-value of -0.04564493 , but still statistically significant (p = 0.0001466). This could be driven by the fact that some of the outliers in the income variable report having access to only 1 or 2 vehicles. It’s also possible that regardless of how much income one earns, most homes in the Northern Colorado area only have one- to two-car garages.
veh_time_correlation <- cor.test(transpo_data$vehicle, transpo_data$JWMNP)
veh_time_correlation
##
## Pearson's product-moment correlation
##
## data: transpo_data$vehicle and transpo_data$JWMNP
## t = 2.5957, df = 6912, p-value = 0.009459
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.007639928 0.054738125
## sample estimates:
## cor
## 0.03120635
options(scipen = 999)
ggplot(transpo_data, aes(x = JWMNP, y = vehicle)) +
geom_point(size = 0.5, alpha = 0.25, position = "jitter") +
geom_smooth() +
theme_bw() +
scale_x_continuous(name = "Travel Time to Work") +
scale_y_continuous(name = "Number of Vehicles Accessible",
breaks = seq(0, 6, by = 1),
labels = paste(prettyNum(seq(0, 6, by = 1),
big.mark = ","))) +
annotate(geom = "text", x = 0, y = 7,
label = paste("Correlation (95-percent confidence):",
prettyNum(veh_time_correlation$conf.int[1], digits = 2),
"to",
prettyNum(veh_time_correlation$conf.int[2], digits = 2)),
hjust = 0)
The correlation between travel time to work and number of vehicles accessible is weak, with an r-value of 0.03120635. It is statistically significant, though (p = 0.009459).
veh_rent_correlation <- cor.test(transpo_data$vehicle, transpo_data$GRNTP)
veh_rent_correlation
##
## Pearson's product-moment correlation
##
## data: transpo_data$vehicle and transpo_data$GRNTP
## t = 25.663, df = 6912, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2732750 0.3163199
## sample estimates:
## cor
## 0.2949471
options(scipen = 999)
ggplot(transpo_data, aes(x = GRNTP, y = vehicle)) +
geom_point(size = 0.5, alpha = 0.25, position = "jitter") +
geom_smooth() +
theme_bw() +
scale_x_continuous(name = "Gross Rent") +
scale_y_continuous(name = "Number of Vehicles Accessible",
breaks = seq(0, 6, by = 1),
labels = paste(prettyNum(seq(0, 6, by = 1),
big.mark = ","))) +
annotate(geom = "text", x = 0, y = 7,
label = paste("Correlation (95-percent confidence):",
prettyNum(veh_rent_correlation$conf.int[1], digits = 2),
"to",
prettyNum(veh_rent_correlation$conf.int[2], digits = 2)),
hjust = 0)
This correlation is the strongest one I’ve found yet, with an r-value of 0.2949471. It is also statistically significant. It’s interesting to note that the trend line increases until about a rent of $3000, after which it decreases.
veh_age_correlation <- cor.test(transpo_data$vehicle, transpo_data$AGEP)
veh_age_correlation
##
## Pearson's product-moment correlation
##
## data: transpo_data$vehicle and transpo_data$AGEP
## t = -10.427, df = 6912, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1475839 -0.1011695
## sample estimates:
## cor
## -0.1244448
options(scipen = 999)
ggplot(transpo_data, aes(x = AGEP, y = vehicle)) +
geom_point(size = 0.5, alpha = 0.25, position = "jitter") +
geom_smooth() +
theme_bw() +
scale_x_continuous(name = "Age") +
scale_y_continuous(name = "Number of Vehicles Accessible",
breaks = seq(0, 6, by = 1),
labels = paste(prettyNum(seq(0, 6, by = 1),
big.mark = ","))) +
annotate(geom = "text", x = 0, y = 7,
label = paste("Correlation (95-percent confidence):",
prettyNum(veh_age_correlation$conf.int[1], digits = 2),
"to",
prettyNum(veh_age_correlation$conf.int[2], digits = 2)),
hjust = 0)
This correlation is weak with an r-value of -0.1244448. It is still statistically significant. The trend line suggests a general decline in the number of vehicles accessible as age increases, which makes some sense since senior citizens might choose to drive less as they get older.
income_time_correlation <- cor.test(transpo_data$PINCP, transpo_data$JWMNP)
income_time_correlation
##
## Pearson's product-moment correlation
##
## data: transpo_data$PINCP and transpo_data$JWMNP
## t = 3.2073, df = 6912, p-value = 0.001346
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.01499023 0.06206430
## sample estimates:
## cor
## 0.03854865
options(scipen = 999)
ggplot(transpo_data, aes(x = PINCP, y = JWMNP)) +
geom_point(size = 0.5, alpha = 0.25, position = "jitter") +
geom_smooth() +
theme_bw() +
scale_x_continuous(name = "Income") +
scale_y_continuous(name = "Travel Time to Work in Minutes") +
annotate(geom = "text", x = 0, y = 170,
label = paste("Correlation (95-percent confidence):",
prettyNum(income_time_correlation$conf.int[1], digits = 2),
"to",
prettyNum(income_time_correlation$conf.int[2], digits = 2)),
hjust = 0)
This correlation is weak (r = 0.03854865) though still statistically significant (p = 0.001346). The trendline is almost exactly flat, which means that a change in income does not necessarily predict a change in travel time to work.
income_rent_correlation <- cor.test(transpo_data$PINCP, transpo_data$GRNTP)
income_rent_correlation
##
## Pearson's product-moment correlation
##
## data: transpo_data$PINCP and transpo_data$GRNTP
## t = 20.869, df = 6912, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2211605 0.2655116
## sample estimates:
## cor
## 0.2434633
options(scipen = 999)
ggplot(transpo_data, aes(x = PINCP, y = GRNTP)) +
geom_point(size = 0.5, alpha = 0.25, position = "jitter") +
geom_smooth() +
theme_bw() +
scale_x_continuous(name = "Income") +
scale_y_continuous(name = "Gross Rent") +
annotate(geom = "text", x = 0, y = 4300,
label = paste("Correlation (95-percent confidence):",
prettyNum(income_rent_correlation$conf.int[1], digits = 2),
"to",
prettyNum(income_rent_correlation$conf.int[2], digits = 2)),
hjust = 0)
This correlation is fairly weak, though stronger than several of the ones I’ve found so far (r = 0.2434633). It is also statistically significant (p < 2.2e-16). This general relationship should be somewhat expected: the higher your income, the higher rent you can afford.
income_age_correlation <- cor.test(transpo_data$PINCP, transpo_data$AGEP)
income_age_correlation
##
## Pearson's product-moment correlation
##
## data: transpo_data$PINCP and transpo_data$AGEP
## t = 17.34, df = 6912, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1814715 0.2266514
## sample estimates:
## cor
## 0.2041702
options(scipen = 999)
ggplot(transpo_data, aes(x = AGEP, y = PINCP)) +
geom_point(size = 0.5, alpha = 0.25, position = "jitter") +
geom_smooth() +
theme_bw() +
scale_x_continuous(name = "AGE") +
scale_y_continuous(name = "INCOME",
breaks = seq(0, 600000, by = 100000)) +
annotate(geom = "text", x = 0, y = 550000,
label = paste("Correlation (95-percent confidence):",
prettyNum(income_age_correlation$conf.int[1], digits = 2),
"to",
prettyNum(income_age_correlation$conf.int[2], digits = 2)),
hjust = 0)
This correlation is fairly weak (r = 0.2041702), though it is statistically significant. The general trend makes some sense: it’s expected that one’s income will grow as they get older.
time_rent_correlation <- cor.test(transpo_data$GRNTP, transpo_data$JWMNP)
time_rent_correlation
##
## Pearson's product-moment correlation
##
## data: transpo_data$GRNTP and transpo_data$JWMNP
## t = 4.4475, df = 6912, p-value = 0.000008822
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.02988474 0.07689436
## sample estimates:
## cor
## 0.05341915
options(scipen = 999)
ggplot(transpo_data, aes(x = GRNTP, y = JWMNP)) +
geom_point(size = 0.5, alpha = 0.25, position = "jitter") +
geom_smooth() +
theme_bw() +
scale_x_continuous(name = "Gross Rent") +
scale_y_continuous(name = "Travel Time to Work in Minutes") +
annotate(geom = "text", x = 0, y = 170,
label = paste("Correlation (95-percent confidence):",
prettyNum(time_rent_correlation$conf.int[1], digits = 2),
"to",
prettyNum(time_rent_correlation$conf.int[2], digits = 2)),
hjust = 0)
This correlation is weak (r = 0.05341915) though still statistically significant (p < 0.5). The trendline is almost exactly flat, which means that a change in rent does not necessarily predict a change in travel time to work.
time_age_correlation <- cor.test(transpo_data$AGEP, transpo_data$JWMNP)
time_age_correlation
##
## Pearson's product-moment correlation
##
## data: transpo_data$AGEP and transpo_data$JWMNP
## t = 3.4264, df = 6912, p-value = 0.0006152
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.01762312 0.06468731
## sample estimates:
## cor
## 0.04117806
options(scipen = 999)
ggplot(transpo_data, aes(x = AGEP, y = JWMNP)) +
geom_point(size = 0.5, alpha = 0.25, position = "jitter") +
geom_smooth() +
theme_bw() +
scale_x_continuous(name = "Age") +
scale_y_continuous(name = "Travel Time to Work in Minutes") +
annotate(geom = "text", x = 0, y = 170,
label = paste("Correlation (95-percent confidence):",
prettyNum(time_age_correlation$conf.int[1], digits = 2),
"to",
prettyNum(time_age_correlation$conf.int[2], digits = 2)),
hjust = 0)
This correlation is weak (r = 0.04117806) though still statistically significant (p = 0.0006152). The trendline is almost exactly flat, which means that a change in age does not necessarily predict a change in travel time to work.
age_rent_correlation <- cor.test(transpo_data$AGEP, transpo_data$GRNTP)
age_rent_correlation
##
## Pearson's product-moment correlation
##
## data: transpo_data$AGEP and transpo_data$GRNTP
## t = -5.0605, df = 6912, p-value = 0.0000004289
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08420749 -0.03723733
## sample estimates:
## cor
## -0.06075605
options(scipen = 999)
ggplot(transpo_data, aes(x = AGEP, y = GRNTP)) +
geom_point(size = 0.5, alpha = 0.25, position = "jitter") +
geom_smooth() +
theme_bw() +
scale_x_continuous(name = "Age") +
scale_y_continuous(name = "Rent") +
annotate(geom = "text", x = 0, y = 5000,
label = paste("Correlation (95-percent confidence):",
prettyNum(age_rent_correlation$conf.int[1], digits = 2),
"to",
prettyNum(age_rent_correlation$conf.int[2], digits = 2)),
hjust = 0)
This correlation is weak (r = -0.06075605) though still statistically significant (p < 0.05). The trendline is almost flat, with a slight decrease, which might suggest (not robustly) that as one gets older, they pay less for rent.
By running this two sample t-test, I am attempting to answer: do males have access to more vehicles than females do?
veh_difference = t.test(vehicle ~ SEX_label == "Male",
data = transpo_data)
veh_difference
##
## Welch Two Sample t-test
##
## data: vehicle by SEX_label == "Male"
## t = -3.4203, df = 6795.3, p-value = 0.0006293
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.13663313 -0.03707391
## sample estimates:
## mean in group FALSE mean in group TRUE
## 1.878664 1.965517
Males generally have access to more vehicles than females do. Males generally have access to 1.97 cars while females generally have access to 1.88 cars. This is statistically significant (p = 0.0006293). We can visualize this result with a box plot.
ggplot(transpo_data, aes(x = SEX_label, y = vehicle)) +
geom_boxplot() +
theme_bw() +
scale_x_discrete(name = "Sex",
labels = c("Female", "Male")) +
scale_y_continuous(name = "Number of Vehicles Accessible",
breaks = seq(0, 6, by = 1),
labels = paste(prettyNum(seq(0, 6, by = 1),
big.mark = ","))) +
annotate(geom = "text", x = 1.5, y = 3, size = 3,
label = paste("Difference in means\n(95% confidence):\n",
prettyNum(abs(veh_difference$conf.int[2]),
digits = 3), " to ",
prettyNum(abs(veh_difference$conf.int[1]),
digits = 3),
sep = "")) +
annotate(geom = "text", x = 1, y = veh_difference$estimate[1], size = 3,
label = paste("Mean (95% confidence):",
prettyNum(veh_difference$estimate[1], digits = 3))) +
annotate(geom = "text", x = 2, y = veh_difference$estimate[2], size = 3,
label = paste("Mean (95% confidence):",
prettyNum(veh_difference$estimate[2], digits = 3)))
Although the difference in mean number of vehicles accessible by sex is statistically significant, it’s practically not interesting since you can only have whole cars. Either value would likely be rounded up to 2 for practical purposes.
By running this two sample t-test, I am attempting to answer: do males earn a higher income than females?
income_difference = t.test(PINCP ~ SEX_label == "Male",
data = transpo_data)
income_difference
##
## Welch Two Sample t-test
##
## data: PINCP by SEX_label == "Male"
## t = -11.907, df = 6705, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -13157.602 -9437.736
## sample estimates:
## mean in group FALSE mean in group TRUE
## 36613.52 47911.19
Males generally earn a higher income than females do. Males, on average, earn 47,911.19 dollars while females earn 36,613.52 dollars. This is statistically significant (p < 0.05). We can visualize this result with a violin plot.
ggplot(transpo_data, aes(x = SEX_label, y = PINCP)) +
geom_violin() +
theme_bw() +
scale_x_discrete(name = "Sex",
labels = c("Female", "Male")) +
scale_y_continuous(name = "Income",
breaks = seq(0, 500000, by = 50000),
labels = paste("$",
prettyNum(seq(0, 500000, by = 50000),
big.mark = ","))) +
annotate(geom = "text", x = 1.5, y = 250000, size = 3,
label = paste("Difference in means\n(95% confidence):\n$",
prettyNum(abs(income_difference$conf.int[2]),
digits = 0), " to $",
prettyNum(abs(income_difference$conf.int[1]),
digits = 0),
sep = "")) +
annotate(geom = "text", x = 1, y = income_difference$estimate[1], size = 3,
label = paste("Mean (95% confidence):\n$",
prettyNum(income_difference$estimate[1], digits = 0))) +
annotate(geom = "text", x = 2, y = income_difference$estimate[2], size = 3,
label = paste("Mean (95% confidence):\n$",
prettyNum(income_difference$estimate[2], digits = 0)))
With the outliers in the data, the difference in means doesn’t appear to be great, but according to the two-sample t-test, we can be about 95% confident that males earn between 9,438 and 13,158 more dollars than females, on average.
By running this two sample t-test, I am attempting to answer: do males have a longer travel time to work then females do?
time_difference = t.test(JWMNP ~ SEX_label == "Male",
data = transpo_data)
time_difference
##
## Welch Two Sample t-test
##
## data: JWMNP by SEX_label == "Male"
## t = -4.5831, df = 6871.9, p-value = 0.000004663
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.425799 -1.373147
## sample estimates:
## mean in group FALSE mean in group TRUE
## 23.48692 25.88639
Though just by little, males generally travel for a longer time than females when going to work. Males, on average, take 25.89 minutes to get to work, while females take 23.49 minutes to get to work. This is statistically significant (p < 0.05). We can visualize this result with a violin plot.
ggplot(transpo_data, aes(x = SEX_label, y = JWMNP)) +
geom_violin() +
theme_bw() +
scale_x_discrete(name = "Sex",
labels = c("Female", "Male")) +
scale_y_continuous(name = "Travel time to work in minutes",
breaks = seq(0, 150, by = 15),
labels = paste(prettyNum(seq(0, 150, by = 15),
big.mark = ","))) +
annotate(geom = "text", x = 1.5, y = 75, size = 3,
label = paste("Difference in means\n(95% confidence):\n",
prettyNum(abs(time_difference$conf.int[2]),
digits = 3), " to ",
prettyNum(abs(time_difference$conf.int[1]),
digits = 3),
sep = "")) +
annotate(geom = "text", x = 1, y = time_difference$estimate[1], size = 3,
label = paste("Mean (95% confidence):\n",
prettyNum(time_difference$estimate[1], digits = 3))) +
annotate(geom = "text", x = 2, y = time_difference$estimate[2], size = 3,
label = paste("Mean (95% confidence):\n",
prettyNum(time_difference$estimate[2], digits = 3)))
With the outliers in the data, the difference in means doesn’t appear to be great. Indeed, on average, males only travel for about 2.4 minutes longer than females to work.
By running this two sample t-test, I am attempting to answer: do males payer higher rents then females do?
rent_difference = t.test(GRNTP ~ SEX_label == "Male",
data = transpo_data)
rent_difference
##
## Welch Two Sample t-test
##
## data: GRNTP by SEX_label == "Male"
## t = -1.1739, df = 6737.1, p-value = 0.2405
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -43.95413 11.02898
## sample estimates:
## mean in group FALSE mean in group TRUE
## 1457.158 1473.620
The difference in means is not statstically significant (p = 0.2405). This means that while the mean rent for males is 1473.62 and the mean rent for females is 1457.16, we can’t be at least 95% confident that this difference didn’t just occur by chance. In fact, if we look at a violin plot, we can see that the distributions and means are almost identical.
ggplot(transpo_data, aes(x = SEX_label, y = GRNTP)) +
geom_violin() +
theme_bw() +
scale_x_discrete(name = "Sex",
labels = c("Female", "Male")) +
scale_y_continuous(name = "Rent",
breaks = seq(0, 5000, by = 500),
labels = paste("$",
prettyNum(seq(0, 5000, by = 500),
big.mark = ","))) +
annotate(geom = "text", x = 1.5, y = 2500, size = 3,
label = paste("Difference in means:\n$",
prettyNum(abs(rent_difference$conf.int[2]),
digits = 0), " to $",
prettyNum(abs(rent_difference$conf.int[1]),
digits = 0),
sep = "")) +
annotate(geom = "text", x = 1, y = rent_difference$estimate[1], size = 3,
label = paste("Mean:\n$",
prettyNum(rent_difference$estimate[1], digits = 0))) +
annotate(geom = "text", x = 2, y = rent_difference$estimate[2], size = 3,
label = paste("Mean:\n$",
prettyNum(rent_difference$estimate[2], digits = 0)))
By running this two sample t-test, I am attempting to answer: are males older than females? I have a strong suspicion that this difference in means will not be statistically significant.
age_difference = t.test(AGEP ~ SEX_label == "Male",
data = transpo_data)
age_difference
##
## Welch Two Sample t-test
##
## data: AGEP by SEX_label == "Male"
## t = 0.50473, df = 6593.2, p-value = 0.6138
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4504762 0.7628869
## sample estimates:
## mean in group FALSE mean in group TRUE
## 35.97951 35.82331
As I predicted, the difference in means of ages between males and females is not statistically significant (p = 0.6138). We cannot be at least 95% sure that the difference didn’t just occur by chance. This makes sense, as nowadays sex (at least in aggregate - this likely is different if we examine sex and race or sex and socioeconomic status) doesn’t necessarily affect how old one grows to be.
To see if there’s a statistically significant association between the number of vehicles one has access to and one’s means of transportation, we can do an analysis of variance.
means_veh_anova <- aov(vehicle ~ meansTW, data = transpo_data)
summary(means_veh_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## meansTW 9 294 32.72 30.47 <0.0000000000000002 ***
## Residuals 6904 7413 1.07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is much less than 0.05, so we can be 95% confident that there is a significant association between the means of transportation one takes to work and the number of vehicles one has access to. Tukey’s Honestly Significant Difference test shows us the magnitude of the differences among the different categories.
means_veh_differences <- TukeyHSD(means_veh_anova)
as_tibble(cbind(pair = row.names(means_veh_differences$meansTW),
means_veh_differences$meansTW))
## # A tibble: 45 x 5
## pair diff lwr upr `p adj`
## <chr> <chr> <chr> <chr> <chr>
## 1 "Bus or\ntrolley bus-… -0.48204621… -0.84823583… -0.1158566… 0.0012956125404…
## 2 "Car, truck,\nor van-… 0.285170995… -0.00662020… 0.57696219… 0.0618462911305…
## 3 "Motorcycle-Bicycle" 0.474321705… -0.39488040… 1.34352381… 0.7802322642717…
## 4 "Other-Bicycle" -0.04104714… -0.55062276… 0.46852847… 0.99999992127209
## 5 "Railroad-Bicycle" -0.30692829… -0.95456985… 0.34071326… 0.8926766984664…
## 6 "Streetcar or\ntrolle… -0.31317829… -1.38965382… 0.76329723… 0.99584851056228
## 7 "Subway or\nelevated-… -0.27200182… -0.90419689… 0.36019324… 0.9386794293834…
## 8 "Taxicab-Bicycle" -0.60791513… -1.41375884… 0.19792856… 0.3334113735028…
## 9 "Walk-Bicycle" -0.37207042… -0.71077471… -0.0333661… 0.0183251405639…
## 10 "Car, truck,\nor van-… 0.767217214… 0.538081481… 0.99635294… 0.0000000000042…
## # … with 35 more rows
We can visualize these differences with a box plot.
ggplot(transpo_data, aes(x = meansTW, y = vehicle)) +
geom_boxplot() +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
scale_x_discrete(name = "Means of Transportation to Work") +
scale_y_continuous(name = "Number of Accessible Vehicles",
breaks = seq(0, 6, by = 1),
labels = paste(prettyNum(seq(0, 6, by = 1))))
I find it interesting that the Bus or Trolley Bus category has the largest distribution of accessible vehicles. I also find it interesting that the Motorcycle category also has a larger distribution than the Car, truck, or van category.
means_income_anova <- aov(PINCP ~ meansTW, data = transpo_data)
summary(means_income_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## meansTW 9 34404094709 3822677190 2.302 0.0141 *
## Residuals 6904 11467191892516 1660948999
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is less than 0.05, we can be 95% confident that there is a significant association between the means of transportation one takes to work and one’s income. Tukey’s Honestly Significant Difference test shows us the magnitude of the differences among the different categories.
means_income_differences <- TukeyHSD(means_income_anova)
as_tibble(cbind(pair = row.names(means_income_differences$meansTW),
means_income_differences$meansTW))
## # A tibble: 45 x 5
## pair diff lwr upr `p adj`
## <chr> <chr> <chr> <chr> <chr>
## 1 "Bus or\ntrolley bus-Bicy… -2109.26649… -16511.4577… 12292.9248… 0.999985048…
## 2 "Car, truck,\nor van-Bicy… 5853.218674… -5622.89231… 17329.3296… 0.841631122…
## 3 "Motorcycle-Bicycle" 2008.929263… -32176.6801… 36194.5386… 0.999999995…
## 4 "Other-Bicycle" 8669.021476… -11372.5235… 28710.5665… 0.936671450…
## 5 "Railroad-Bicycle" 18481.42926… -6990.23168… 43953.0902… 0.391441517…
## 6 "Streetcar or\ntrolley ca… 21217.05426… -21120.5926… 63554.7011… 0.855287534…
## 7 "Subway or\nelevated-Bicy… 5267.642498… -19596.5099… 30131.7949… 0.999662834…
## 8 "Taxicab-Bicycle" 487.0542635… -31206.6771… 32180.7857… 1
## 9 "Walk-Bicycle" 503.8297737… -12817.3673… 13825.0269… 0.999999999…
## 10 "Car, truck,\nor van-Bus … 7962.485165… -1049.39435… 16974.3646… 0.137745555…
## # … with 35 more rows
I can also represent these differences with a violin plot.
ggplot(transpo_data, aes(x = meansTW, y = PINCP)) +
geom_violin() +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
scale_x_discrete(name = "Means of Transportation to Work") +
scale_y_continuous(name = "Income",
breaks = seq(0, 500000, by = 50000),
labels = paste("$",
prettyNum(seq(0, 500000, by = 50000),
big.mark = ",")))
I find it interesting that the few outliers that exist in the data set are not only found in the Car, truck, or van category (which makes sense, given that private vehicles are relatively expensive), but they are also found in the Bus or trolley bus and Walk categories (which are fairly cheap or free).
means_time_anova <- aov(JWMNP ~ meansTW, data = transpo_data)
summary(means_time_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## meansTW 9 174081 19342 42.58 <0.0000000000000002 ***
## Residuals 6904 3136051 454
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is less than 0.05, we can be 95% confident that there is a significant association between the means of transportation and the time it takes to get to work. Tukey’s Honestly Significant Difference test shows us the magnitude of the differences among the different categories.
means_time_differences <- TukeyHSD(means_time_anova)
as_tibble(cbind(pair = row.names(means_time_differences$meansTW),
means_time_differences$meansTW))
## # A tibble: 45 x 5
## pair diff lwr upr `p adj`
## <chr> <chr> <chr> <chr> <chr>
## 1 "Bus or\ntrolley bus-B… 28.7219540… 21.1902783… 36.2536298… 0.00000000000417…
## 2 "Car, truck,\nor van-B… 7.19085532… 1.18938340… 13.1923272… 0.00585046150564…
## 3 "Motorcycle-Bicycle" -0.1153100… -17.992792… 17.7621725… 1
## 4 "Other-Bicycle" 21.9728046… 11.4920093… 32.4535999… 0.00000000157607…
## 5 "Railroad-Bicycle" 21.8221899… 8.50169676… 35.1426830… 0.00000986190328…
## 6 "Streetcar or\ntrolley… 33.2596899… 11.1190710… 55.4003087… 0.00008850596362…
## 7 "Subway or\nelevated-B… 25.6126310… 12.6098366… 38.6154255… 0.00000002165434…
## 8 "Taxicab-Bicycle" -1.1876784… -17.762024… 15.3866679… 0.999999972195923
## 9 "Walk-Bicycle" -2.0508057… -9.0171718… 4.91556040… 0.995461793462294
## 10 "Car, truck,\nor van-B… -21.531098… -26.243892… -16.818305… 0.00000000000417…
## # … with 35 more rows
I can also represent these differences with a box plot.
ggplot(transpo_data, aes(x = meansTW, y = JWMNP)) +
geom_boxplot() +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
scale_x_discrete(name = "Means of Transportation to Work") +
scale_y_continuous(name = "Travel Time to Work in Minutes",
breaks = seq(0, 150, by = 15),
labels = paste(prettyNum(seq(0, 150, by = 15))))
Although the median travel times for cars, trucks, vans, and bicycles are lower than public transportation, they also have several outliers for longer trip times. I suspect the outlier in the walk category might be a mistake. The median travel times are highest for buses, streetcars, and subways. In the state of Colorado, which is where my sample exists, this makes sense bcause our cities are rarely built densely enough so that public transit is actually more efficient than driving.
means_rent_anova <- aov(GRNTP ~ meansTW, data = transpo_data)
summary(means_rent_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## meansTW 9 9268619 1029847 3.056 0.00117 **
## Residuals 6904 2326520636 336982
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is less than 0.05, we can be 95% confident that there is a significant association between the means of transportation to work and one’s rent. Tukey’s Honestly Significant Difference test shows us the magnitude of the differences among the different categories.
means_rent_differences <- TukeyHSD(means_rent_anova)
as_tibble(cbind(pair = row.names(means_rent_differences$meansTW),
means_rent_differences$meansTW))
## # A tibble: 45 x 5
## pair diff lwr upr `p adj`
## <chr> <chr> <chr> <chr> <chr>
## 1 "Bus or\ntrolley bus-Bic… -205.748793… -410.89027… -0.60731075… 0.0486062845…
## 2 "Car, truck,\nor van-Bic… -65.9336261… -229.39669… 97.52944510… 0.9589329333…
## 3 "Motorcycle-Bicycle" -120.168604… -607.10051… 366.7633105… 0.9988420015…
## 4 "Other-Bicycle" -141.057948… -426.52508… 144.4091843… 0.8655538224…
## 5 "Railroad-Bicycle" 1.831395348… -360.98105… 364.6438437… 1
## 6 "Streetcar or\ntrolley c… -173.918604… -776.96625… 429.1290475… 0.9961140102…
## 7 "Subway or\nelevated-Bic… -84.8009575… -438.96019… 269.3582803… 0.9990899041…
## 8 "Taxicab-Bicycle" -49.2080783… -500.64626… 402.2301034… 0.9999988587…
## 9 "Walk-Bicycle" -180.535222… -370.27927… 9.208829015… 0.0781066851…
## 10 "Car, truck,\nor van-Bus… 139.8151672… 11.4520392… 268.1782951… 0.0202553564…
## # … with 35 more rows
I can also represent these differences with a box plot.
ggplot(transpo_data, aes(x = meansTW, y = GRNTP)) +
geom_boxplot() +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
scale_x_discrete(name = "Means of Transportation to Work") +
scale_y_continuous(name = "Rent",
breaks = seq(0, 5000, by = 500),
labels = paste("$",
prettyNum(seq(0, 5000, by = 500),
big.mark = ",")))
There doesn’t appear, visually at least, to be huge differences among the medians. However, the Car, truck, or van category has the most outliers for highest rents. Interestingly the highest median rent is in the railroad category.
means_age_anova <- aov(AGEP ~ meansTW, data = transpo_data)
summary(means_age_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## meansTW 9 7412 823.6 5.082 0.000000712 ***
## Residuals 6904 1118894 162.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is less than 0.05, we can be 95% confident that there is a significant association between the means of transportation to work and one’s age. Tukey’s Honestly Significant Difference test shows us the magnitude of the differences among the different categories.
means_age_differences <- TukeyHSD(means_age_anova)
as_tibble(cbind(pair = row.names(means_age_differences$meansTW),
means_age_differences$meansTW))
## # A tibble: 45 x 5
## pair diff lwr upr `p adj`
## <chr> <chr> <chr> <chr> <chr>
## 1 "Bus or\ntrolley bus-Bic… 2.618765540… -1.88001128… 7.11754236… 0.7080878635…
## 2 "Car, truck,\nor van-Bic… 4.745546282… 1.160781918… 8.33031064… 0.0011722825…
## 3 "Motorcycle-Bicycle" -0.19137596… -10.8698501… 10.4870981… 1
## 4 "Other-Bicycle" 4.463337145… -1.79699063… 10.7236649… 0.4181029082…
## 5 "Railroad-Bicycle" 3.339874031… -4.61664560… 11.2963936… 0.9472629836…
## 6 "Streetcar or\ntrolley c… 4.396124031… -8.82878186… 17.6210299… 0.9889288743…
## 7 "Subway or\nelevated-Bic… 0.731418148… -7.03533556… 8.49817186… 0.9999996855…
## 8 "Taxicab-Bicycle" 0.496124031… -9.40396835… 10.3962164… 0.9999999988…
## 9 "Walk-Bicycle" 1.755599249… -2.40551010… 5.91670860… 0.9455836605…
## 10 "Car, truck,\nor van-Bus… 2.126780742… -0.68823774… 4.94179922… 0.3312297556…
## # … with 35 more rows
I can also represent this with a violin plot.
ggplot(transpo_data, aes(x = meansTW, y = AGEP)) +
geom_violin() +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
scale_x_discrete(name = "Means of Transportation to Work") +
scale_y_continuous(name = "Age",
breaks = seq(0, 90, by = 10),
labels = paste(prettyNum(seq(0, 90, by = 10))))
The greatest spread of ages exists in the car, truck, or van category. It seems that generally, the mean ages are higher for the streetcar, railroad, and car, truck, or van categories.
ed_veh_anova <- aov(vehicle ~ edu, data = transpo_data)
summary(ed_veh_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## edu 23 86 3.723 3.365 0.0000000933 ***
## Residuals 6890 7622 1.106
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is less than 0.05, we can be 95% confident that there is a significant association between one’s educational attainment and the number of accessible vehicles. Tukey’s Honestly Significant Difference test shows us the magnitude of the differences among the different categories.
ed_veh_differences <- TukeyHSD(ed_veh_anova)
as_tibble(cbind(pair = row.names(ed_veh_differences$edu),
ed_veh_differences$edu))
## # A tibble: 276 x 5
## pair diff lwr upr `p adj`
## <chr> <chr> <chr> <chr> <chr>
## 1 "> 1 yr of college,\nno de… 0.0070356797… -0.2020488… 0.21612024… 1
## 2 "12th grade -\nno diploma-… -0.026329787… -0.4287553… 0.37609575… 1
## 3 "Associate's degree-<1 yea… -0.062676574… -0.2978274… 0.17247427… 0.99999929…
## 4 "Bachelor's degree-<1 year… -0.076152578… -0.2740201… 0.12171499… 0.99952198…
## 5 "Doctorate degree-<1 year … -0.351584889… -0.7765900… 0.07342025… 0.28778706…
## 6 "GED-<1 year of college" 0.0004559270… -0.2801181… 0.28102997… 1
## 7 "Grade 1-<1 year of colleg… 0.0361702127… -2.6758258… 2.74816627… 1
## 8 "Grade 10-<1 year of colle… 0.1240823006… -0.3142402… 0.56240482… 0.99999776…
## 9 "Grade 11-<1 year of colle… 0.0495035460… -0.3094045… 0.40841165… 0.99999999…
## 10 "Grade 2-<1 year of colleg… 0.0361702127… -2.6758258… 2.74816627… 1
## # … with 266 more rows
I can also represent this with a box plot.
ggplot(transpo_data, aes(x = edu, y = vehicle)) +
geom_boxplot() +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
scale_x_discrete(name = "Educational Attainment") +
scale_y_continuous(name = "Number of Accessible Vehicles",
breaks = seq(0, 6, by = 1),
labels = paste(prettyNum(seq(0, 6, by = 1))))
For some odd reason, people who only ever went to school through grade 4 have the highest median number of cars. According to my dataset, only 3 people in my entire sample only completed school through grade 4, so those three people alone are making it appear that people who only completed school through 4th grade might have access to more cars than most people.
ed_income_anova <- aov(PINCP ~ edu, data = transpo_data)
summary(ed_income_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## edu 23 1226400780121 53321773049 35.76 <0.0000000000000002 ***
## Residuals 6890 10275195207105 1491320059
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is less than 0.05, we can be 95% confident that there is a significant association between one’s educational attainment and their income. Tukey’s Honestly Significant Difference test shows us the magnitude of the differences among the different categories.
ed_income_differences <- TukeyHSD(ed_income_anova)
as_tibble(cbind(pair = row.names(ed_income_differences$edu),
ed_income_differences$edu))
## # A tibble: 276 x 5
## pair diff lwr upr `p adj`
## <chr> <chr> <chr> <chr> <chr>
## 1 "> 1 yr of college,\nno d… -2054.1360… -9730.897… 5622.6254… 0.9999992378795…
## 2 "12th grade -\nno diploma… 1795.89931… -12979.58… 16571.380… 1
## 3 "Associate's degree-<1 ye… 2074.67152… -6559.141… 10708.484… 0.9999999077892…
## 4 "Bachelor's degree-<1 yea… 14580.1551… 7315.2372… 21845.073… 0.0000000000885…
## 5 "Doctorate degree-<1 year… 25189.7386… 9585.2238… 40794.253… 0.0000012162780…
## 6 "GED-<1 year of college" -901.81983… -11203.39… 9399.7539… 1
## 7 "Grade 1-<1 year of colle… 542.595744… -99031.21… 100116.40… 1
## 8 "Grade 10-<1 year of coll… -16822.239… -32915.71… -728.7627… 0.0283260103649…
## 9 "Grade 11-<1 year of coll… -14224.537… -27402.22… -1046.845… 0.0179645130007…
## 10 "Grade 2-<1 year of colle… -17957.404… -117531.2… 81616.409… 0.9999999997555…
## # … with 266 more rows
I can represent this with a violin plot.
ggplot(transpo_data, aes(x = edu, y = PINCP)) +
geom_violin() +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
scale_x_discrete(name = "Educational Attainment") +
scale_y_continuous(name = "Income",
breaks = seq(0, 500000, by = 50000),
labels = paste("$",
prettyNum(seq(0, 500000, by = 50000),
big.mark = ",")))
Interestingly the No Schooling category reaches quite high, and even higher than most of the grades. It seems that outliers are pulling the mean incomes of several variables up, including those who went to college for more than a year but earned no degree, those who finished 12th grade but without a high school diploma, those who have bachelor’s degrees, high school diplomas, master’s degrees, and professional degrees.
ed_time_anova <- aov(JWMNP ~ edu, data = transpo_data)
summary(ed_time_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## edu 23 15792 686.6 1.436 0.0811 .
## Residuals 6890 3294340 478.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is greater than 0.05, we can be at least 95% confident that there is a significant association between one’s educational attainment and their travel time to work.
ed_rent_anova <- aov(GRNTP ~ edu, data = transpo_data)
summary(ed_rent_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## edu 23 109518144 4761658 14.74 <0.0000000000000002 ***
## Residuals 6890 2226271110 323116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is less than 0.05, we can be 95% confident that there is a significant association between one’s educational attainment and their monthly rent. Tukey’s Honestly Significant Difference test shows us the magnitude of the differences among the different categories.
ed_rent_differences <- TukeyHSD(ed_rent_anova)
as_tibble(cbind(pair = row.names(ed_rent_differences$edu),
ed_rent_differences$edu))
## # A tibble: 276 x 5
## pair diff lwr upr `p adj`
## <chr> <chr> <chr> <chr> <chr>
## 1 "> 1 yr of college,\nno d… 61.4619701… -51.53636… 174.46030… 0.9523798211348…
## 2 "12th grade -\nno diploma… 73.8333206… -143.6548… 291.32147… 0.99994012745025
## 3 "Associate's degree-<1 ye… 39.3664902… -87.71919… 166.45217… 0.9999881693356…
## 4 "Bachelor's degree-<1 yea… 230.748465… 123.81227… 337.68465… 0.0000000000034…
## 5 "Doctorate degree-<1 year… 259.066739… 29.375593… 488.75788… 0.0090706720066…
## 6 "GED-<1 year of college" -68.005153… -219.6394… 83.629186… 0.9954045416988…
## 7 "Grade 1-<1 year of colle… -243.51489… -1709.194… 1222.1649… 0.99999999995768
## 8 "Grade 10-<1 year of coll… 17.2213701… -219.6670… 254.10980… 1
## 9 "Grade 11-<1 year of coll… -20.261560… -214.2310… 173.70789… 1
## 10 "Grade 2-<1 year of colle… -283.51489… -1749.194… 1182.1649… 0.9999999989244…
## # … with 266 more rows
I can represent these differences with a box plot.
ggplot(transpo_data, aes(x = edu, y = GRNTP)) +
geom_boxplot() +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
scale_x_discrete(name = "Educational Attainment") +
scale_y_continuous(name = "Rent",
breaks = seq(0, 5000, by = 500),
labels = paste("$",
prettyNum(seq(0, 5000, by = 500),
big.mark = ",")))
The lowest median rent belongs to the preschool category, but it turns out there is only one person in my dataset who only ever finished preschool. The highest median rents tend to be found in the higher education degrees: Bachelor’s, Master’s, Doctorates, and other professional degrees. So - if anything, my MUP degree means I’ll end up paying higher rent than others! I am just kidding, of course.
ed_age_anova <- aov(AGEP ~ edu, data = transpo_data)
summary(ed_age_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## edu 23 36378 1581.7 9.999 <0.0000000000000002 ***
## Residuals 6890 1089927 158.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is less than 0.05, we can be 95% confident that there is a significant association between one’s educational attainment and their age. Tukey’s Honestly Significant Difference test shows us the magnitude of the differences among the different categories.
ed_age_differences <- TukeyHSD(ed_age_anova)
as_tibble(cbind(pair = row.names(ed_age_differences$edu),
ed_age_differences$edu))
## # A tibble: 276 x 5
## pair diff lwr upr `p adj`
## <chr> <chr> <chr> <chr> <chr>
## 1 "> 1 yr of college,\nno deg… -1.7614049… -4.2616454… 0.73883555… 0.626299710…
## 2 "12th grade -\nno diploma-<… 0.66816109… -4.1440575… 5.48037978… 0.999999999…
## 3 "Associate's degree-<1 year… 1.67754565… -1.1343964… 4.48948780… 0.884676287…
## 4 "Bachelor's degree-<1 year … -2.7916262… -5.1577336… -0.4255188… 0.004162250…
## 5 "Doctorate degree-<1 year o… 0.58397742… -4.4982489… 5.66620378… 1
## 6 "GED-<1 year of college" 1.66978447… -1.6853297… 5.02489873… 0.982449399…
## 7 "Grade 1-<1 year of college" -6.2425531… -38.672697… 26.1875907… 0.999999999…
## 8 "Grade 10-<1 year of colleg… -6.2700257… -11.511501… -1.0285496… 0.003224154…
## 9 "Grade 11-<1 year of colleg… -5.2158865… -9.5077222… -0.9240508… 0.002382054…
## 10 "Grade 2-<1 year of college" 15.7574468… -16.672697… 48.1875907… 0.986907355…
## # … with 266 more rows
I can represent these differences with a violin plot.
ggplot(transpo_data, aes(x = edu, y = AGEP)) +
geom_violin() +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
scale_x_discrete(name = "Educational Attainment") +
scale_y_continuous(name = "Age",
breaks = seq(0, 90, by = 10),
labels = paste(prettyNum(seq(0, 90, by = 10))))
The shape of the Grade 3 distribution is interesting to me, though it appears there are only 3 people in my data set who only ever completed Grade 3. Despite the differences in age by educational attainment, I don’t think this data necessarily highlights any interesting trends.
By running this Chi-square test, I am attempting to answer: is there a relationship between one’s sex and their means of transportation to work?
sex_means_chi_sq <- chisq.test(transpo_data$meansTW, transpo_data$SEX_label == "Male")
sex_means_chi_sq
##
## Pearson's Chi-squared test
##
## data: transpo_data$meansTW and transpo_data$SEX_label == "Male"
## X-squared = 31.544, df = 9, p-value = 0.0002387
The p-value is less than 0.05, so the relationship is significant at a 95% confidence interval. These are the values observed for each combination of variables.
sex_means_chi_sq$observed
##
## transpo_data$meansTW FALSE TRUE
## Bicycle 36 93
## Bus or\ntrolley bus 89 123
## Car, truck,\nor van 2825 3233
## Motorcycle 3 13
## Other 26 35
## Railroad 11 21
## Streetcar or\ntrolley car 2 8
## Subway or\nelevated 11 23
## Taxicab 9 10
## Walk 161 182
And here are the values we would expect if the two variables were not related.
sex_means_chi_sq$expected
##
## transpo_data$meansTW FALSE TRUE
## Bicycle 59.201186 69.798814
## Bus or\ntrolley bus 97.291872 114.708128
## Car, truck,\nor van 2780.161122 3277.838878
## Motorcycle 7.342783 8.657217
## Other 27.994359 33.005641
## Railroad 14.685566 17.314434
## Streetcar or\ntrolley car 4.589239 5.410761
## Subway or\nelevated 15.603413 18.396587
## Taxicab 8.719555 10.280445
## Walk 157.410905 185.589095
By running this Chi-square test, I am attempting to answer: is there a relationship between one’s sex and their educational attainment?
sex_edu_chi_sq <- chisq.test(transpo_data$edu, transpo_data$SEX_label == "Male")
sex_edu_chi_sq
##
## Pearson's Chi-squared test
##
## data: transpo_data$edu and transpo_data$SEX_label == "Male"
## X-squared = 82.147, df = 23, p-value = 0.00000001421
The p-value is less than 0.05, so the relationship is significant at a 95% confidence interval. These are the values observed for each combination of variables.
sex_edu_chi_sq$observed
##
## transpo_data$edu FALSE TRUE
## <1 year of college 223 247
## > 1 yr of college,\nno degree 553 614
## 12th grade -\nno diploma 32 80
## Associate's degree 307 300
## Bachelor's degree 916 918
## Doctorate degree 43 55
## GED 110 198
## Grade 1 1 1
## Grade 10 42 49
## Grade 11 58 92
## Grade 2 0 2
## Grade 3 0 3
## Grade 4 1 2
## Grade 5 1 5
## Grade 6 21 31
## Grade 7 6 9
## Grade 8 10 14
## Grade 9 23 33
## High school diploma 475 705
## Kindergarten 1 1
## Master's degree 289 284
## No schooling 16 34
## Preschool 0 1
## Professional degree 45 63
And here are the values we would expect if the two variables were not related.
sex_edu_chi_sq$expected
##
## transpo_data$edu FALSE TRUE
## <1 year of college 215.6942436 254.3057564
## > 1 yr of college,\nno degree 535.5642175 631.4357825
## 12th grade -\nno diploma 51.3994793 60.6005207
## Associate's degree 278.5668209 328.4331791
## Bachelor's degree 841.6664738 992.3335262
## Doctorate degree 44.9745444 53.0254556
## GED 141.3485681 166.6514319
## Grade 1 0.9178478 1.0821522
## Grade 10 41.7620769 49.2379231
## Grade 11 68.8385884 81.1614116
## Grade 2 0.9178478 1.0821522
## Grade 3 1.3767718 1.6232282
## Grade 4 1.3767718 1.6232282
## Grade 5 2.7535435 3.2464565
## Grade 6 23.8640440 28.1359560
## Grade 7 6.8838588 8.1161412
## Grade 8 11.0141741 12.9858259
## Grade 9 25.6997397 30.3002603
## High school diploma 541.5302285 638.4697715
## Kindergarten 0.9178478 1.0821522
## Master's degree 262.9634076 310.0365924
## No schooling 22.9461961 27.0538039
## Preschool 0.4589239 0.5410761
## Professional degree 49.5637836 58.4362164
By running this Chi-square test, I am attempting to answer: is there a relationship between one’s educational attainment and whether they take a car, truck, or van to work?
edu_means_chi_sq <- chisq.test(transpo_data$edu, transpo_data$meansTW)
edu_means_chi_sq
##
## Pearson's Chi-squared test
##
## data: transpo_data$edu and transpo_data$meansTW
## X-squared = 205.6, df = 207, p-value = 0.5145
The p-value is far greater than 0.05, so the relationship is not statistically significant at a 95% confidence interval. These are the values observed for each combination of variables.
edu_means_chi_sq$observed
## transpo_data$meansTW
## transpo_data$edu Bicycle Bus or\ntrolley bus Car, truck,\nor van
## <1 year of college 1 16 426
## > 1 yr of college,\nno degree 15 31 1034
## 12th grade -\nno diploma 2 3 100
## Associate's degree 3 8 558
## Bachelor's degree 62 66 1558
## Doctorate degree 7 4 78
## GED 3 15 270
## Grade 1 0 0 2
## Grade 10 0 2 81
## Grade 11 2 3 136
## Grade 2 0 0 2
## Grade 3 0 0 3
## Grade 4 0 0 3
## Grade 5 0 0 5
## Grade 6 0 2 47
## Grade 7 1 1 13
## Grade 8 0 0 23
## Grade 9 0 1 53
## High school diploma 12 38 1039
## Kindergarten 0 0 2
## Master's degree 20 17 482
## No schooling 0 4 42
## Preschool 0 0 1
## Professional degree 1 1 100
## transpo_data$meansTW
## transpo_data$edu Motorcycle Other Railroad
## <1 year of college 0 5 0
## > 1 yr of college,\nno degree 7 6 8
## 12th grade -\nno diploma 0 0 0
## Associate's degree 2 6 4
## Bachelor's degree 3 14 9
## Doctorate degree 0 0 2
## GED 0 3 0
## Grade 1 0 0 0
## Grade 10 0 0 0
## Grade 11 0 1 0
## Grade 2 0 0 0
## Grade 3 0 0 0
## Grade 4 0 0 0
## Grade 5 0 0 0
## Grade 6 0 1 0
## Grade 7 0 0 0
## Grade 8 0 1 0
## Grade 9 0 1 0
## High school diploma 3 16 1
## Kindergarten 0 0 0
## Master's degree 1 4 7
## No schooling 0 1 1
## Preschool 0 0 0
## Professional degree 0 2 0
## transpo_data$meansTW
## transpo_data$edu Streetcar or\ntrolley car Subway or\nelevated
## <1 year of college 1 2
## > 1 yr of college,\nno degree 0 3
## 12th grade -\nno diploma 0 1
## Associate's degree 1 2
## Bachelor's degree 5 14
## Doctorate degree 0 1
## GED 1 1
## Grade 1 0 0
## Grade 10 0 0
## Grade 11 0 1
## Grade 2 0 0
## Grade 3 0 0
## Grade 4 0 0
## Grade 5 0 0
## Grade 6 0 0
## Grade 7 0 0
## Grade 8 0 0
## Grade 9 0 0
## High school diploma 1 4
## Kindergarten 0 0
## Master's degree 0 5
## No schooling 0 0
## Preschool 0 0
## Professional degree 1 0
## transpo_data$meansTW
## transpo_data$edu Taxicab Walk
## <1 year of college 3 16
## > 1 yr of college,\nno degree 3 60
## 12th grade -\nno diploma 0 6
## Associate's degree 2 21
## Bachelor's degree 1 102
## Doctorate degree 0 6
## GED 0 15
## Grade 1 0 0
## Grade 10 0 8
## Grade 11 1 6
## Grade 2 0 0
## Grade 3 0 0
## Grade 4 0 0
## Grade 5 0 1
## Grade 6 0 2
## Grade 7 0 0
## Grade 8 0 0
## Grade 9 0 1
## High school diploma 6 60
## Kindergarten 0 0
## Master's degree 3 34
## No schooling 0 2
## Preschool 0 0
## Professional degree 0 3
And these are the expected values.
edu_means_chi_sq$expected
## transpo_data$meansTW
## transpo_data$edu Bicycle Bus or\ntrolley bus
## <1 year of college 8.76916402 14.41133931
## > 1 yr of college,\nno degree 21.77364767 35.78304889
## 12th grade -\nno diploma 2.08967313 3.43419150
## Associate's degree 11.32528204 18.61209141
## Bachelor's degree 34.21839745 56.23488574
## Doctorate degree 1.82846399 3.00491756
## GED 5.74660110 9.44402661
## Grade 1 0.03731559 0.06132485
## Grade 10 1.69785942 2.79028059
## Grade 11 2.79866937 4.59936361
## Grade 2 0.03731559 0.06132485
## Grade 3 0.05597339 0.09198727
## Grade 4 0.05597339 0.09198727
## Grade 5 0.11194677 0.18397454
## Grade 6 0.97020538 1.59444605
## Grade 7 0.27986694 0.45993636
## Grade 8 0.44778710 0.73589818
## Grade 9 1.04483656 1.71709575
## High school diploma 22.01619902 36.18166040
## Kindergarten 0.03731559 0.06132485
## Master's degree 10.69091698 17.56956899
## No schooling 0.93288979 1.53312120
## Preschool 0.01865780 0.03066242
## Professional degree 2.01504194 3.31154180
## transpo_data$meansTW
## transpo_data$edu Car, truck,\nor van Motorcycle Other
## <1 year of college 411.8108186 1.087648250 4.146658953
## > 1 yr of college,\nno degree 1022.5175007 2.700607463 10.296065953
## 12th grade -\nno diploma 98.1336419 0.259184264 0.988140006
## Associate's degree 531.8492913 1.404686144 5.355365924
## Bachelor's degree 1606.9383859 4.244142320 16.180792595
## Doctorate degree 85.8669367 0.226786231 0.864622505
## GED 269.8675152 0.712756725 2.717385016
## Grade 1 1.7523865 0.004628290 0.017645357
## Grade 10 79.7335840 0.210587214 0.802863755
## Grade 11 131.4289847 0.347121782 1.323401793
## Grade 2 1.7523865 0.004628290 0.017645357
## Grade 3 2.6285797 0.006942436 0.026468036
## Grade 4 2.6285797 0.006942436 0.026468036
## Grade 5 5.2571594 0.013884871 0.052936072
## Grade 6 45.5620480 0.120335551 0.458779288
## Grade 7 13.1428985 0.034712178 0.132340179
## Grade 8 21.0286375 0.055539485 0.211744287
## Grade 9 49.0668209 0.129592132 0.494070003
## High school diploma 1033.9080127 2.730691351 10.410760775
## Kindergarten 1.7523865 0.004628290 0.017645357
## Master's degree 502.0587214 1.326005207 5.055394851
## No schooling 43.8096616 0.115707261 0.441133931
## Preschool 0.8761932 0.002314145 0.008822679
## Professional degree 94.6288690 0.249927683 0.952849291
## transpo_data$meansTW
## transpo_data$edu Railroad Streetcar or\ntrolley car
## <1 year of college 2.175296500 0.679780156
## > 1 yr of college,\nno degree 5.401214926 1.687879664
## 12th grade -\nno diploma 0.518368528 0.161990165
## Associate's degree 2.809372288 0.877928840
## Bachelor's degree 8.488284640 2.652588950
## Doctorate degree 0.453572462 0.141741394
## GED 1.425513451 0.445472953
## Grade 1 0.009256581 0.002892682
## Grade 10 0.421174429 0.131617009
## Grade 11 0.694243564 0.216951114
## Grade 2 0.009256581 0.002892682
## Grade 3 0.013884871 0.004339022
## Grade 4 0.013884871 0.004339022
## Grade 5 0.027769743 0.008678045
## Grade 6 0.240671102 0.075209719
## Grade 7 0.069424356 0.021695111
## Grade 8 0.111078970 0.034712178
## Grade 9 0.259184264 0.080995082
## High school diploma 5.461382702 1.706682094
## Kindergarten 0.009256581 0.002892682
## Master's degree 2.652010414 0.828753254
## No schooling 0.231414521 0.072317038
## Preschool 0.004628290 0.001446341
## Professional degree 0.499855366 0.156204802
## transpo_data$meansTW
## transpo_data$edu Subway or\nelevated Taxicab Walk
## <1 year of college 2.311252531 1.291582297 23.31645936
## > 1 yr of college,\nno degree 5.738790859 3.206971362 57.89427249
## 12th grade -\nno diploma 0.550766561 0.307781313 5.55626266
## Associate's degree 2.984958056 1.668064796 30.11295921
## Bachelor's degree 9.018802430 5.039919005 90.98380098
## Doctorate degree 0.481920741 0.269308649 4.86172982
## GED 1.514608042 0.846398612 15.27972230
## Grade 1 0.009835117 0.005496095 0.09921898
## Grade 10 0.447497830 0.250072317 4.51446341
## Grade 11 0.737633787 0.412207116 7.44142320
## Grade 2 0.009835117 0.005496095 0.09921898
## Grade 3 0.014752676 0.008244142 0.14882846
## Grade 4 0.014752676 0.008244142 0.14882846
## Grade 5 0.029505351 0.016488285 0.29765693
## Grade 6 0.255713046 0.142898467 2.57969338
## Grade 7 0.073763379 0.041220712 0.74414232
## Grade 8 0.118021406 0.065953139 1.19062771
## Grade 9 0.275383280 0.153890657 2.77813133
## High school diploma 5.802719121 3.242695979 58.53919583
## Kindergarten 0.009835117 0.005496095 0.09921898
## Master's degree 2.817761065 1.574631183 28.42623662
## No schooling 0.245877929 0.137402372 2.48047440
## Preschool 0.004917559 0.002748047 0.04960949
## Professional degree 0.531096326 0.296789124 5.35782470